컴퓨터 비전으로 전환하기: 왜 CNN인가?

컴퓨터 비전으로 전환하기

오늘 우리는 단순하고 구조화된 데이터를 기본적인 선형 계층을 사용해 다루는 방식에서, 고차원 이미지 데이터를 다루는 방식으로 전환합니다. 단일 색상 이미지는 표준 아키텍처가 효율적으로 처리할 수 없는 상당한 복잡성을 초래합니다. 시각 정보에 대한 딥러닝은 특수한 접근법이 필요합니다: 컨볼루션 신경망(CNN)입니다.

1. 완전 연결 네트워크(FCN)가 실패하는 이유

FCN에서는 모든 입력 픽셀이 다음 레이어의 모든 뉴런과 연결되어야 합니다. 고해상도 이미지에서는 이로 인해 계산량이 폭발적으로 증가하게 되며, 과적합이 극심해져 학습이 불가능하고 일반화 성능이 낮아집니다.

Input Dimension: A standard $224 \times 224$ RGB image results in $150,528$ input features ($224 \times 224 \times 3$).
Hidden Layer Size: If the first hidden layer uses 1,024 neurons.
Total Parameters (Layer 1): $\approx 154$ million weights ($150,528 \times 1024$) just for the first connection block, requiring massive memory and compute time.

CNN의 해결책

CNN은 이미지의 공간 구조를 활용하여 FCN의 확장성 문제를 해결합니다. 작은 필터를 사용해 패턴(예: 경계선 또는 곡선)을 식별함으로써 파라미터 수를 수십만 배 이상 줄이고, 모델의 강건성을 높입니다.

TERMINALbash — model-env

> Ready. Click "Run" to execute.

PARAMETER EFFICIENCY INSPECTOR Live

Run comparison to visualize parameter counts.

Question 1

What is the primary benefit of using Local Receptive Fields in CNNs?

Filters only focus on a small, localized region of the input image.

It allows the network to process the entire image globally at once.

It ensures all parameters are initialized to zero.

It removes the need for activation functions.

Question 2

If a $3 \times 3$ filter is applied across an entire image, what core CNN concept is being utilized?

Kernel Normalization

Shared Weights

Full Connectivity

Feature Transposition

Question 3

Which CNN component is responsible for progressively reducing the spatial dimensions (width and height) of the feature maps?

ReLU Activation

Pooling Layers (Subsampling)

Batch Normalization

Challenge: Identifying Key CNN Components

Relate CNN mechanisms to their functional benefits.

We need to build a vision model that is highly parameter efficient and can recognize an object even if it slightly shifts its position in the image.

Step 1

Which mechanism ensures the network can identify a feature (like a diagonal line) regardless of where it is in the frame?

Solution:
Shared Weights. By using the same filter across all locations, the network learns translation invariance.

Step 2

What architectural choice allows a CNN to detect features with fewer parameters than an FCN?

Solution:
Local Receptive Fields (or Sparse Connectivity). Instead of connecting to every pixel, each neuron only connects to a small, localized region of the input.

Step 3

How does the CNN structure lead to hierarchical feature learning (e.g., edges $\to$ corners $\to$ objects)?

Solution:
Stacked Layers. Early layers learn simple features (edges) using convolution. Deeper layers combine the outputs of earlier layers to form complex, abstract features (objects).